Search CORE

65 research outputs found

A hardware mechanism to reduce the energy consumption of the register file of in-order architectures

Author: Ayala Rodrigo José Luis
López Barrio Carlos Alberto
López Vallejo Marisa
Veidenbaum Alexander
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2008
Field of study

This paper introduces an efficient hardware approach to reduce the register file energy consumption by turning unused registers into a low power state. Bypassing the register fields of the fetch instruction to the decode stage allows the identification of registers required by the current instruction (instruction predecode) and allows the control logic to turn them back on. They are put into the low-power state after the instruction use. This technique achieves an 85% energy reduction with no performance penalty

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

Author: Abraham Danny
Givargis Tony
Heddes Mike
Nicolau Alexandru
Nunes Igor
Veidenbaum Alexander
Vergés Pere
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/05/2023
Field of study

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time

arXiv.org e-Print Archive

Enhancing the Privacy of Machine Learning via faster arithmetic over Torus FHE

Author: Alexander Veidenbaum
Alexandru Nicolau
Marc Titus Trifan
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 18/05/2023
Field of study

The increased popularity of Machine Learning as a Service (MLaaS) makes the privacy of user data and network weights a critical concern. Using Torus FHE (TFHE) offers a solution for privacy-preserving computation in a cloud environment by allowing computation directly over encrypted data. However, software TFHE implementations of cyphertext-cyphertext multiplication needed when both input data and weights are encrypted are either lacking or are too slow. This paper proposes a new way to improve the performance of such multiplication by applying carry save addition. Its theoretical speedup is proportional to the bit width of the plaintext integer operands. This also speeds up multi-operand summation. A speedup of 15x is obtained for 16-bit multiplication on a 64-core processor, when compared to previous results. Multiplication also becomes more than twice as fast on a GPU if our approach is utilized. This leads to much faster dot product and convolution computations, which combine multiplications and a multi-operand sum. A 45x speedup is achieved for a 16-bit, 32-element dot product and a ~30x speedup for a convolution with a 32x32 filter size

Cryptology ePrint Archive

A distributed processor state management architecture for large-window processors

Author: Cristal Kestelman Adrián
Galluzzi Marco
González Isidro
Ramírez Marco Antonio
Valero Cortés Mateo
Veidenbaum Alexander V.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version

CiteSeerX

UPCommons. Portal del coneixement obert de la UPC

Instruction Cache Prefetching Using Multilevel Branch Prediction

Author: Alexander V. Veidenbaum
Publication venue
Publication date: 01/01/1997
Field of study

This paper presents an instruction cache prefetching mechanism capable of prefetching past branches in multiple-issue processors. Such processors at high clock rates often use small instruction caches which have significant miss rates. Prefetching from secondary cache can hide the instruction cache miss penalties but only if initiated sufficiently far ahead of the current program counter. Existing instruction cache prefetching methods are strictly sequential and cannot do that due to their inability to prefetch past branches. By keeping branch history and branch target addresses we predict a future PC several branches past the current branch. We describe a possible prefetching architecture and evaluate its accuracy, the impact of the instruction prefetching on performance, and its interaction with sequential prefetching. For a 4issue processor and a cache architecture patterned after the DEC Alpha-21164 we show that our prefetching unit can be more effective than sequential prefetching..

CiteSeerX

Decoupled Access DRAM Architecture

Author: Alexander V. Veidenbaum
Alexander Veidenbaum Dept
K. A. Gallivan
Publication venue
Publication date
Field of study

This paper discusses an approach to reducing memory latency in future systems. It focuses on systems where a single chip DRAM/processor will not be feasible even in 10 years, e.g. systems requiring a large memory and/or many CPU's. In such systems a solution needs to be found to DRAM latency and bandwidth as well as to inter-chip communication. Utilizing the projected advances in chip I/O bandwidth we propose to implement a decoupled access-execute processor where the access processor is placed in memory. Aprogram is compiledtorunasacomputational process and several access processes with the latter executing in the DRAM processors. Instruction set extensions are discussedto support this paradigm. Using multi-level branch prediction the access processor stays ahead of the execute processor and keeps the latter supplied with data. The system reduces latency by moving address computation to memory and thus avoiding sending address to memory by the computational processor. This and the fetchahead capabilities of the access processor arecombined with multiple DRAM "streaming" to improve performance. DRAM caching is assumedtobeused to assist in this as well

CiteSeerX

Stride-directed Prefetching for Secondary Caches

Author: Alexander V. Veidenbaum
Sunil Kim
Publication venue: IEEE Computer Society
Publication date: 01/01/1997
Field of study

skim @ aus tin.ibm. co

CiteSeerX